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Many programming languages and tools, ranging from grep to the Java String library, contain regular 
expression matchers. Rather than first translating a regular expression into a deterministic finite 
automaton, such implementations typically match the regular expression on the fly. Thus they can be 
seen as virtual machines interpreting the regular expression much as if it were a program with some 
non-deterministic constructs such as the Kleene star. We formalize this implementation technique for 
regular expression matching using operational semantics. Specifically, we derive a series of abstract 
machines, moving from the abstract definition of matching to increasingly realistic machines. First 
a continuation is added to the operational semantics to describe what remains to be matched after 
the current expression. Next, we represent the expression as a data structure using pointers, which 
enables redundant searches to be eliminated via testing for pointer equality. From there, we arrive 
both at Thompson's lockstep construction and a machine that performs some operations in parallel, 
suitable for implementation on a large number of cores, such as a GPU. We formalize the parallel 
machine using process algebra and report some preliminary experiments with an implementation on 
a graphics processor using CUDA. 

1 Introduction 

Regular expressions form a minimalistic language of pattern-matching constructs. Originally defined in 
Kleene's work on the foundations of computation, they have become ubiquitous in computing. Their 
practical significance was boosted by Thompson's efficient construction lfT3l of a regular expression 
matcher based on the "lockstep" simulation of a Non-deterministic Finite Automaton (NFA), and the 
wide use of regular expressions in Unix tools such as grep and awk. 

The regular expression matchers used in such tools differ in detail from the implementation of reg- 
ular expressions used in compiler construction for lexical analysis. In compiling, lexical analyzers are 
typically built by constructing a Deterministic Finite Automaton (DFA), using one of the standard results 
of automata theory. The DFA can process input very efficiently, but its construction incurs an additional 
overhead before any input can be matched. Moreover, the DFA construction only works if the matching 
language really is a regular language, so that it can be recognized by a DFA. Many matching languages 
add constructs that take the language beyond what a DFA can recognize, for instance back references. 
(By abuse of terminology, such extended languages are sometimes still referred to as "regexes".) 

Recently, Cox @ has given a rational reconstruction of Thompson's classic NFA matcher in terms 
of virtual machines. In essence, a regular expression is interpreted on the fly, much as a program in 
an interpreted programming language. The interpreter is a kind of virtual machine, with a small set of 
instructions suitable for running regular expressions. For instance, the Kleene star e* gives a form of 
non-deterministic loop. Cox emphasizes that the virtual machine approach in the style of Thompson is 
both flexible and efficient. Once a basic virtual machine for regular expressions is set up, other constructs 
such as back-references can be added with relative ease. Moreover, the machine is much more efficient 
than other implementation techniques based on a more naive backtracking interpreter [4], which exhibit 
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exponential run-time in some cases. Surprisingly, these inefficient matchers are widely used in Java and 
Perl El. 

In this paper, we formalize the view of regular expression matchers as machines by using tools from 
programming language theory, specifically operational semantics. We do so starting from the usual 
definition of regular expressions and their meaning, and then defining increasingly realistic machines. 

We first define some preliminaries and recall what it means for a string to match a regular expression 
in Section [2J from our perspective, matching is a simple form of big-step semantics, and we aim to 
refine it into a small-step semantics. To do so in Section we introduce a distinction between a current 
expression and its continuation. We then refine this semantics by representing the regular expression 
as a syntax tree using pointers in memory (Section SJ). Crucially, the pointer representation allows us 
to compare sub-expressions by pointer equality (rather than structurally). This pointer equality test is 
needed for the efficient elimination of redundant match attempts, which underlies the general lockstep 
NFA simulation presented in Section[5] We recover Thompson's machine as a sequential implementation 
of the lockstep construction (Section [6]). Since the lockstep construction involves simulating many non- 
deterministic machines in parallel, we then explore a parallel version using some simple process algebra 
in Section[7] The parallel process semantics is then related to a prototype implementation we have written 
in CUDA to run on a Graphics Processor Unit (GPU) in Section [8] Section [9] concludes with some 
future directions. The overall plan of the paper can be visualised as follows: 

Regular expression matching as big-step semantics (Sec. [2]) 

| Small step with continuations 
EKW machine (Sec.© 

| Pointer representation 
PW7T machine (Sec.|U) 
| Macro steps 
Generic lockstep construction (Sec.|5]) 
Sequential schedulin^^---^' "~~~~~-\^Parallel scheduling 

Sequential matcher (Sec. |6]> Parallel matcher (Sec. U} 

J Processes as threads in CUDA 
Implementation on Graphics Processor (Sec. [8]> 



2 Regular expression matching as a big-step semantics 



Let £ be a finite set, regarded as the input alphabet. We use the following abstract syntax for regular 
expressions: 

£ 



e 
e 
e 
e 
e 



a 

e* 

e\e 2 
e\ | e 2 



where a € £ 



We let e range over regular expressions, a over characters, and w over strings of characters. The 
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e J, w 

e\ I W\ e 2 i w 2 



(Seq) — — (Match) — (Epsilon) 
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e* I (wi w 2 ) e ^ £ 
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(AltI) (Alt2) 
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Figure 2.1: Regular expression matching as a big-step semantics 

empty string is written as e. Note that there is also a regular expression constant e. We also write the 
sequential composition e\e 2 as e\ •e 2 when we want to emphasise it as the occurrence of an operator 
applied to e\ and e 2 , for instance in a syntax tree. For strings w\ and w 2 , we write their concatenation as 
juxtaposition w\w 2 . A single character a is also regarded as a string of length 1. 

Our starting point is the usual definition of what it means for a string w to match a regular expression 
e. We write this relation as e | w, regarding it as a big-step operation semantics for a language with 
non-deterministic branching e\ \ e 2 and a non-deterministic loop e*. The rules are given in Figure |2~T1 

Some of our operational semantics will use lists. We write h :: t for constructing a list with head h 
and tail t. The concatenation of two lists s and t is written as s@t. For example, 1 :: [2] = [1,2] and 
[1,2]@[3] = [1,2,3]. The empty list is written as []. 



3 The EKW machine 

The big-step operational semantics of matching in Figure 12.11 gives us little information about how we 
should attempt to match a given input string w. We define a small-step semantics, called the EKW 
machine, that makes the matching process more explicit. In the tradition of the SECD machine 0, the 
machine is named after its components: E for expression, K for continuation, W for word to be matched. 

Definition 3.1 A configuration of the EKW machine is of the form (e ; k ; w) where e is a regular 
expression, k is a list of regular expressions, and w is a string. The transitions of the EKW machine are 
given in Figure |3~T1 The accepting configuration is (e ; [] ; e). 

Here e is the regular expression the machine is currently focusing on. What remains to the right of the 
current expression is represented by k, the current continuation. The combination of e and k together is 
attempting to match w, the current input string. 

Note that many of the rules are fairly standard, specifically the pushing and popping of the contin- 
uation stack. The machine is non-deterministic. The paired rules with the same current expressions e* 
or (e\ \ e 2 ) give rise to branching in order to search for matches, where it is sufficient that one of the 
branches succeeds. 

Theorem 3.2 (Partial correctness) e \. w if and only if there is a run 



<«;[];*,)-►■■■-> <e;[];e> 
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(e ; k ; w) ->• {e' ; k 1 ; W) 



e\ | e% ; k ; w) - 


-> (ei ; ; w) 


e\ | e2 ; k ; w) - 


(e2',k;w) 


(e\ e2 ; k ; w) - 


^ (ei ;e 2 ::k ;w 


(<?* ; * ; w) - 


^ (e ; e* :: k ;w) 


(e* ;k;w) - 


-> (e ; ; w) 


(a \ k ;aw) - 


■> (e ; ; w) 


(e ; e :: k ; w) - 


■> (e ; A: ; w) 



(3.1) 
(3.2) 
(3.3) 
(3.4) 
(3.5) 
(3.6) 
(3.7) 



Figure 3.1: EKW machine transition steps 

Example 3.3 Unfortunately, while Theorem l3.2l ensures that all matching strings are correctly accepted, 
there is no guarantee that the machine accepts all strings that it should on every run. In fact, there are 
valid inputs on which the machine may enter an infinite loop; an example is the configuration (a** ; [] ; a). 

(«**;[]; a) -> (a*; [a**] ;a) 
— > (e ; [a**] ; a) 
->■ {a** ;[];a) 



Such infinite loops can be prevented by backtracking and pruning. However, backtracking implementa- 
tions can still take a very long time matching expressions like a** to a string consisting of, say, 1000 oc- 
currences of a character a followed by some other b, due to the exponentially increasing search space 1U. 

In Thompson's matcher, such loops are avoided by means of redundancy elimination. The matcher 
checks whether it has encountered the same expression before. Note, however, that "the same" expression 
is to be taken in the sense of pointer equality rather than structural equality. For instance, the two 
occurrences of a in (ab) \ (ac) would be taken as not the same, given their different positions in the 
syntax tree. 



4 The PW7T machine 

We refine the EKW machine by representing the regular expression as a data structure in a heap n, which 
serves as the program run by the machine. That way, the machine can distinguish between different 
positions in the syntax tree. 

Definition 4.1 A heap % is a finite partial function from addresses to values. There exists a distinguished 
address null, which is not mapped to any value. 

In our setting, the values are syntax free nodes, represented by an operator from the syntax of regular 
expressions together with pointers to the tree for the arguments (if any) of the operator. For example, for 
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Figure 4. 1 : The regular expression a** • b as a tree with continuation pointers 

sequential composition, we have a node containing (pi* p 2 ), where the two pointers p\ and P2 point to 
the trees of the two expressions being composed. 

Definition 4.2 We write Cg> for the partial operation of forming the union of two partial functions pro- 
vided that their domains are disjoint. More formally, let /1 : A — B and fa : A — ^ B be two partial 
functions. Then if dom(/i) ndom(/2) = 0, the function 

{h®f 2 ):A^B 

is defined as fa ® fa = fa U fa. 

Note that (g> is the same as the operation * on heaps in separation logic |[TT). and hence a partial 
commutative monoid. We avoid the notation * as it could be confused with the Kleene star. As in 
separation logic, we use cg> to describe data structures with pointers in memory. 

Definition 4.3 We write K,p \= e if p points to the root node of a regular expression e in a heap %. The 
relation is defined by induction on e as follows: 

71, p \=a if 71 (p) = a 

7C,p \= £ if 7c(p) = e 

71, p \=(ei\e 2 ) if TZ = TZQ®TZi®7l2/\Tlo(p) = (jpi \p 2 ) 
AKi,pi \=ei A 7r 2 ,/?2 \=e 2 

71, p H( e l g 2) if K = K0<3Kl ®7t2 ^Xo(p) = (Pi • Pi) 

Altupi ^e x l\7i 2 ,P2 \=e 2 
Ti, p \=ei* i£7t = 7tQ®7Z\A7Co(p)=pi*A7Ci,pi\=ei 



Here the definition of 7i,p (= e precludes any cycles in the child pointer chain. 

As an example, consider the regular expression e = a**b. A % and po such that 7i,po \= e is given by 
the table in Figure [47TI The tree structure, represented by the solid arrows, is drawn on the right. 
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p — > q or p — > q relative to % 



p 


— > P] 


if 7t(p) 


= Pi 1 Pi 


p 


> P2 


if 7C(p) 


= P\\P2 


p 


> Pi 


if 7t(j>) 


= P\*Pl 


p 


> Pi 


if 7t(p) 


= Pl* 


p 


> P2 


if 7C(p) 


= Pi* and cont p = P2 


p 


► Pi 


if 7l(p) 


= £ and cont p = p\ 


p 


P' 


if n(p) 


= a and p' = cont p 



Figure 4.2: PW7T transitions 



Definition 4.4 Let cont be a function 

cont : dom(Tr) — > (dom(7r) U {null}) 

We write % \= cont if 

• If 7r(p) = {p\ | P2), then cont p\ = cont and cont P2 = cont 

• If 7v(p) = (pi • P2), then cont pi = p2 and cont j>2 = cont p 

• If 7t{p) = (pi)*, then cont^i = p 

• cont po = null, where po is the pointer to the root of the syntax tree. 

The function cont is uniquely determined by the tree structure layed out in %, and it is easy to 
compute by a recursive tree walk. We elide it when it is clear from the context, assuming that % always 
comes equipped with a cont such that % |= cont. By treating cont as a function, we have not committed 
to a particular implementation; for instance cont could be represented as a hash table indexed by pointer 
values, or it could be added as another pointer field to the nodes in the heap. 

In the graphical representation in Figure |4~T1 dashed arrows represent cont. In particular, note the 
cycle leading downward from p\ and up again via dashed arrows. Following such a cycle could lead to 
infinite loops as for the EKW machine in Example [33] 



Definition 4.5 The PW7T machine is defined as follows. Transitions of this machine are always relative 
to some heap %, which does not change during evaluation. We elide % if it is clear from the context. 
Configurations of the machine are of the form (p ; w), where p is a pointer in % and w is a string of 
input symbols. Given the transition relation between pointers defined in Figure 1431 the machine has the 
following transitions: 

P-^q p — >q 

(p ;aw) -> (q ; w) (p I w } -> (q ; w) 

The accepting state of the machine is (null ; e). That is, both the continuation and the remaining input 
have been consumed. 
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Example 4.6 For a regular expression e = a**b, let % and p be such that 71, po \= e. See Figure |4~T1 for 
the representation of % as a tree with pointers. The diagram below illustrates two possible executions of 
the PW7T machine against inputs e and aab. 



Execution - 1 : Infinite loop 



(Po 
(Pi 

(P3 
(PI 
(P3 
(PI 
(P3 
(PI 
(P3 
(PI 



aab) 
aab) 
aab) 
aab) 
aab) 
aab) 
aab) 
aab) 
aab) 
aab) 



Execution - 2: Successful match 

— > (pi ; aab) 

— > (p3 ; aab) 

— > (p4 ; aab) 

— > {P3 ; ab) 

— > (p 4 ; ab) 

— > (P3;b) 

— ► (Pilb) 

— >- (P2;b) 

— > (null ; e) 



Theorem 4.7 (Simulation) Let % be a heap such that n,p\=e. Then there is a run of the EKW machine 
of the form 

(e;[];w>-> ► (e ; [] ; e) 

if and only if there is a run of the PWn machine of the form 

(p ; w) ->• >■ (null ; e) 

One needs to show that each step of the EKW machine can be simulated by the PW7T machine and vice 
versa. The invariant in this simulation is that the stack k in the EKW machine can be reconstructed by 
following the chain of pointers in the heap of the PW7T machine via the following function: 

stack;? = j] if cont p = null 

stack/? = e :: (stacks) if q = cont/? ^ null 

and 71, q (= e 

5 The lockstep construction in general 

As we have seen, the PW7T machine is built from two kinds of steps. Pointers can be evolved via p — > q 
by moving in the syntax tree without reading any input. When a node for a constant is reached, it can be 
matched to the first character in the input via a step p — > q. 



Definition 5.1 Let 5 C dom(7r) U {null} be a set of pointers. We define the evolution OS of S as the 
following set: 

OS = {q G dom(7l) | 3p G S.p — >* q A 3a.7l(q) = a} 
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Forming □£ is similar to computing the £ -closure in automata theory. However, this operation is not 
a closure operator, because S C □ S does not hold in general. When one computes □ S incrementally, 
elements are removed as well as added. Avoiding infinite loops by adding and removing the same element 
is the main difficulty in the computation. 

We define a transition relation analogous to Definition 14.51 but as a deterministic relation on sets of 
pointers. We refer to these as macro steps, as they assume the computation of □ S as given in a single 
step, whereas an implementation needs to compute it incrementally. 

Definition 5.2 (Lockstep transitions) Let S, S' C dom(7r) U {null} be sets of pointers. 

S => S' ifS' = DS 

S S' if S' = {q G dom(7l) | 3p G S.p q} 

A set of pointers is first evolved from S to □ S. Then, moving from a set of pointers □ S to S' via □ S ==> S' 
advances the state of the machine by advancing all pointers that can match a to their continuations. All 
other pointers are deleted as unsuccessful matches. 

Definition 5.3 (Generic lockstep machine) The generic lockstep machine has configurations of the form 
(S ; w). Transitions are defined using Definition |5.2t 

S =^> 5' S =^ S' 

(S;aw)^ (S' ; w) (S ; w } => {S' ;w) 

Accepting states of the machine are of the form (S ; e), where null G S. 

Theorem 5.4 For a heap n,p (= e there is a run of the PW7T machine: 

(p ; w) ->■ > (null ; e) 

if and only if there is a run of the lockstep machine 

{{p};w)=>...=>{S;e) 

for some set of pointers S with null G S. 

6 The sequential lockstep machine 

The sequential lockstep machine maintains two lists of pointers c, n corresponding to pointers being 
incrementally evolved within the current macro step and pointers to be evolved in the next macro step. 
Another pointer list t is maintained which provides support for redundancy elimination, we also introduce 
an auxilary function \ff(p,h:h) to aid in this regard: 

Definition 6.1 The auxilary function Y{p,h,h) is defined as: 



Y(p,h,h)=p-- h if p$h@h 
¥(pJuh) = h if p€h@k 
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(c;t ;n;w) ->• (c • ;t ;n ; w) 



{p :: c ; t ; n ; w) 

(p::c;t;n;w) 

(p :: c ;t ;n ;w) 

(p::c;t \n;w) 

(p :: c ;t ;n ; aw) 
(p::c;t;n;aw) 

(p :: c ;t ;n ; aw) 
([];/;« ;aw) 
(p :: c ;t ;n ; e) 



(c' ; p t ; n ; w) if 7r(p) = /?' | p" 

where c' = V(j>' ,c,t),t) 

(c' ; p::t ;n ;w) if 7t(/j) = // • p" 

where c' = \ff(p',c,t) 

(c 1 ; p::t ;n;w) if 7i(p) = (p')* 

where c' = y/"(cont p, y(p,c,t),t) 

(c' ; p :: t ; n ; w) if 7i(p) = £ 

where c' = (//'(cont p,c,t) 

(c;t ;n; aw) if p = null 

(c ; p t ; n' ; aw) if 7r(p) = a 

where «' = ^(cont p,n, []) 

(c ; p:\ t ; n ; aw) if 7r(/?) = b 

(n;[];[];w) ifn^[] 

(c ; p::t ;n ; e) if 7t(/j) = a 



Figure 6.1: Sequential lockstep machine with redundancy elimination 



Definition 6.2 The redundancy-eliminating sequential lockstep machine has configurations of the form 
(c;t;n;w). Its transitions are given in figure I67TI The accepting states are of the form (null :: c' ;t' ; 

ri ; e) 



We regard this machine as a rational reconstruction of Thompson's matcher [ 13'] in the light of Cox's 
elucidation as a virtual machine [5]. This machine uses a sequential schedule for incrementally evolving 
pointers, keeping a list of pointers that have been evolved already to prevent loops and search space 
explosion. However, our main interest is in performing this computation in parallel. 



7 Parallel lockstep semantics 

We now define an operational semantics where each pointer is given a dedicated thread for evolving 
it. Our motivation is to leverage the large number of cores and hence threads available on GPUs. The 
semantics in this section is intended as an idealization of the implementation described in Section [8] 
below, capturing the essentials of the computation while abstracting from implementation details. 

To describe the parallel computation, we define a simple process calculus. Its transition rules are 
given in Figure 17711 Most of our calculus is a subset of CCS |8], with one-to-one directional message 
passing and parallel composition. However, we also need an n-way synchronization with a synchronous 
transition inspired by Synchronous CCS Q. 

We let M range over processes, p over pointers that may be sent as asynchronous messages, and a 
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M — >M' 



M-^M' 



M ^ M * (Par) ( SEND ) 



M , ^($a.M')\\M"' M' -/-> 
— — K a — ( SYNC ) 



($a.Mi || ... || $a.M„ || M) (Mi || ... || M n ) 
Figure 7.1: Process calculus 

over input symbols, which may be used for rc-way synchronisation. The syntax of processes is as follows: 

M ::= p 

| M\\M 
I P-M 
| $a.M 

We impose some structural congruences =, identifying terms up to associativity and commutativity 
of parallel compostion ||. Process transitions can be interleaved with rule PAR. 

We have CCS-style handshake communication in rule Send. Here p.M receives the message ~p 
and proceeds with M afterwards. Note that receivers of the form p.M are not replicated (in the pi- 
calculus sense fTOlO . so that each communication consumes the receiver. This behaviour is essential, as 
the processes we generate could become trapped in an infinite loop otherwise. 

We also have an n-way synchronisation Sync. This rule is the most complex, and it is needed to 
implement matching to input once all pointers have been evolved. The idea is as follows: 

• The current process is factorized into those processes that are of the form $a .Mj and an M' com- 
prising everything else. 

• There are no further — > transitions inside M' , written as M' h 

• If these conditions are met, then all the processes waiting to participate in an n-way synchroniza- 
tion on a are advanced in one synchronous step. 

• The remaining processes in M' are discarded in the same step. 

Rules in this style, in which a number of processes are advanced in a single step, are sometimes 
referred to as "lockstep" O. Indeed, we use this rule to implement the lockstep matching of regular 
expressions in the sense of Thompson and Cox. (In practice, this rule may require a little ad-hoc protocol 
to implement on a given architecture.) 

We translate each expression pointer p in the heap % into a process \p\ % as follows: 





= p 


(qi 


II m) 


if %{p) 


= (n 1 qi) 


IpI% 


= p 


li 




if n(p) 


= (qi»qz) 


IpI% 


= p 


ill 


lift) 


if n(p) 


= q\* and cont p = q 2 


Ipjn 


= p 


~1 




if n(p) 


= e and cont p = q 




= p 


$a 


q 


if n(p) 


= a and cont p = q 
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Intuitively, for each internal node in the expression tree identified by the pointer p, we create a 
dedicated little process that listens on a channel uniquely corresponding to p. For simplicity, we use the 
same name for the channel as for the pointer. The process may be activated by messages ~p sent to it, and 
it may send such messages itself. These messages trigger a chain reaction that evolve the current pointer 
set of a macro step. There is no need for these messages to be externally visible, as their only purpose 
is to wake up their unique recipient. A process p . M listening for p is consumed by the transition that 
receives the message. Processes for nodes that point to input characters a at the leaves of the expression 
tree use a different form of communication. All these nodes synchronize on the input symbol. The 
symbol a is visible in the resulting synchronous transition step — Y, because we need it to agree with the 
next input symbol. 

If dom(Ti) = {pi , . . . ,p n }, we define the translation [?rj as the translation of all its pointers: 

If the input string is not empty, let a be the first character, so that aw' = w. The parallel machine 
launches processes for all the nodes in the tree, and sends a message to the process for the root. The 
resulting process makes a number of asynchronous transitions, followed by a synchronous move for a: 

MIIp— ► 

All these steps together represent one macro step. The machine then repeats the above with the next 
symbol a' and M 

\%\ || AT — > >^Um' 

The machine accepts if the remaining input is empty and the current process is of the form 



null || M 

Example 7.1 For e = a**b, let % and po be such that 7i,po \= e. See Figure |4~T1 for the representation of 
% as a tree with pointers. Translating the tree structure to parallel processes gives us: 

\%\ = ipQ.pt) || pi .(p~3 || pi) ||/>2- $6 -null \\p3-ipl \\pt) \\p4-$a-P~3 
Assume an input string of aab. We have the pointer evolution as follows: 



Po 
>P0 
>~Pl 
>P3 

ypl 



7C 



Po-Pi \\pi-{P3 \\P2) \\P2- $b. null \\ps-iP4 || Pi) \\p4Sa.p3 
Pi ■ iP3 Wpi) \\P2-$b. null ||/?3-(p4 || Pi) \\p4-$a.p^ 



Pi || Pi- %b. null ||/7 3 .(/7 4 \\pt) \\p4.$a.p 3 



$b. null H/73 .(/?4 || pi) \\p4.$a.p3 



y$b.nu.ll H/74 \\pi \\p4.$a.p3 



— >$Z7.null ||/7i || Sa.p^ 
Since no more micro transitions are possible, we have reached the ra-way synchronization point: 



$Z7.null \\pi || $a./7 3 — >• /7 3 
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Now we feed the residual messages back into a fresh pi} : 



P3IIM 



>P3 \\P0-P\ \\P\-(P3 \\P2) \\p2-$b.null \\P3-(P4 || Pi) || P4-Sa.ps 
>P0-Pl\\Pl-(P3\\P2) ||p2-$£-null \\p4 \\pT\\p4.$a.p^ 



>P0-P\ l|pi-(P3 \\pi) \\p2-$b.vul±\\p\ ||$a./73 



>P0-P\ \\P3 \\P2 \\P2 null || $a.p 3 



^Po-Pi II P3 II $6. null || Sa.p?, 

>P3 



mull 



Therefore, we have received a null while the input string has become empty, resulting in a successful 
match. 

We need to prove that the construction above can correctly evolve and match any set of pointers. Let 
5 = {p\ ,...,p n } C dom(7i) U {null} be a set of pointers in the heap. We define 

S = pl\\...\\p7 
to represent this set as a parallel composition of messages. 

Theorem 7.2 Let S,S' C dom(7i) U {null}. We have 
if and only if 

5 1| M 

Moreover, each — > transition sequence starting from S \ \ [tt] is finite. 



Theorem 17.21 assures us that the parellel operational semantics correctly implements the lockstep 
construction. The pointers p in the tree, represented as processes p, are evolved in parallel. Although 
this evolution is non-deterministic, its end result is determinate. Moreover, the cycles in the pointer chain 
do not lead to cyclic processes looping forever, since each receiving process becomes inactive once the 
node has been visited. 

The correctness proof of the parallel implementation relies on a factorisation of the processes into 
four components. At each step i, we have: 

• A set 5, of pointers, indicating nodes that should be evolved. 

• A heap of receivers %\ C %, representing nodes that have not been visited in the current macro step. 

• A set Ei of evolved nodes, whose process representations are of the form ready to match a character. 

• A parallel composition D, of messages to nodes that have already been processed. 
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Let E be a set of pointers E = {pi,...,p n } such that n(pj) = aj and cont pj = q r We write 

$ E = $ a \ . q\ || . . . || $ a n . q n 
We need to consider transition sequences of the form 

3b||M||$£b||A> 

— > TnWlTZnlWtEnWDn 

where 7io = it and Eq = 0. The invariant we need to establish for all transition steps consists of: 

□ So = □(5 ! -ndom(^))U£ i 
ORi C □(5,-ndom(7r,-))U£ ! - 
{p | BD.Di = (p\\D)} C 5 ; -U7?,- 

where /?, = dom(7r) \dom(7T i ). The factorization of proceses at each step and the invariant are verified 
by case analysis on the kind of node 7i(p) and hence the possible — > steps that its translation IpJ % can 
make using the rules from Figure ITTI 

In the final configuration we have S n D dom(7T n ) = 0. Hence, 

□ So = □ {Sn n dom(7r„ ))UE n 

= n0u£„ 

— F 

— L -'n 

Therefore, we have HSo = E n , as required. From that configuration, there can only be an — > transition, 
exactly matching the generic lockstep transition S =^=^ 5'. 

8 Implementation on a GPU 

As a proof of concept, we have written a simple regular expression matcher where the evolution of 
pointers is performed in parallel on a GPU0 Programming the GPU was done via CUDA [3]. The main 
points are: 

• The regular expression is parsed, and the syntax tree nodes are packed into an array d. This array 
represents our heap %. A second pass through the syntax tree performs the wiring of continuation 
pointers, corresponding to cont. 

• Two integer vectors c, n of the same size as the regular expression vector above are created. Here a 
value oft - the macro step count, on c[i] implies that regular expression d[i] is to be simulated within 
the current macro step. On the other hand a value of —t on c[i] implies that the corresponding 
regular expression has already been simulated for the current macro step. This protocol realizes 
the semantics of a process being consumed once it has received a message. The vector n is used to 
collect those search attempts which are able to match the current input character. A value of t + 1 
on n[j] indicates that the regular expression d[j] is to be simulated on the next macro step. 

'The code is available at |http: //www, cs .bham. ac .uk/~hxt/resea rch/rege xp . shtml) 
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• Each regular expression node d[i] is assigned a GPU thread. This GPU thread is responsible for 
conditionally simulating the regular expression d[i] at each invocation (depending on c[i] value). 
While simulating an expression, a GPU thread might schedule another GPU thread / expression 
d[j] by setting c[j] to t (this could happen for an example in the case of e = e\ • ei)- Note that one 
thread scheduling another thread via the c vector corresponds to the sending of a message ~p from 
one process to another. 

• At each invocation of the GPU threads (called a kernel launch in CUDA terminology), each thread 
which performs a successful simulation updates either of two shared flags which indicate if there 
were more threads activated on the c or n vectors during the current invocation. A macro transition 
involves swapping the c and n vectors while incremeting the t counter. It corresponds to the «-way 
synchronization transition. 

• The initial state of the machine has only d[0], the root node process, scheduled for simulation. 

However, note that this description corresponds to a minimalistic GPU-based parallel lockstep machine 
and does not yet incorporate any optimizations from the literature |[T4l . such as persistent threads and 
tasks queues. 

9 Conclusions 

We have derived regular expression matchers as abstract machines. In doing so, we have used a number of 
concepts and techniques from programming language theory. The EKW machine zooms in on a current 
expression while maintaing a continuation for keeping track of what to do next. In that sense, the machine 
is a distant relative of machines for interpreting lambda terms, such as the SECD machine [7 ] or the 
CEK machine [6]. On the other hand, regular expressions are a much simpler language to interpret than 
lambda calculus, so that continuations can be represented by a single pointer into the tree structure (or to 
machine code in Thompson's original implementation). While the idea of continuations as code pointers 
is sometimes advanced as a helpful intuition, the representation of continuations in CPS compiling HI 
is more complex, involving an environment pointer as well. To represent pointers and the structures 
they build up, we found it convenient to use a small fragment of separation logic ifTTTl . given by just the 
separating conjunction and the points-to-predicate. (They are written as ® and 7t(p) = e above, to avoid 
clashes with other notation.) A similar use of these connectives to describe trees in the setting of abstract 
machines was used in our earlier work on B+trees lfl2l . Here we translate a tree-shaped data structure 
into a network of processes that communicate in a cascade of messages mirroring the pointers in the 
tree structure. The semantics of the processes is inspired by the process algebra literature ll8l l9l [TOl. 
One reason why a process algebra is suitable for formalizing the lockstep construction with redundancy 
elimination is that receiving processes are eliminated once they have received a message; they are used 
linearly, and so are reminiscent of linearly-used continuations 0. 

We intend to extend both the process algebra view and our CUDA implementation, while main- 
taining a close correspondence between them. Regular expression matching is an instance of irregular 
parallel lTl4l processing on a GPU, which presents some optimization problems. At the moment, the 
parallel processing power of the GPU cores is not exercised, as each thread does little more than access 
the expression tree and activate threads for other nodes. We expect the load on the GPU cores to become 
more significant when more expensive constructs such as back-references (known to be NP-hard) are 
added to our matching language. It remains to be seen whether a GPU implementation will become 
more efficient than a sequential CPU-based one, particularly as the number of GPU cores continues to 



Rathnayake and Thielecke 



45 



increase (it is currently in the hundreds of cores). More generally, the operational semantics and ab- 
stract machine approach may be fruitful for reasoning about other forms of General Purpose Graphics 
Processing Unit (GPGPU) programming. 
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